In this post I'll show you how to simulate a large dataset of mortality claims. Sometimes it's useful to use simulated data. For example, if you wanted to experiment with new modelling techniques but were unable to find a useful dataset, it may be easiest just to create your own. Alternatively, you may simply wish to supplement existing data with simulated data. If you have some assumptions about the variables you want to create, it's possible to create entire datasets from scratch.

Example

Imagine that you wanted to experiment with fitting different predictive models to life insurance claims data (perhaps to practice your coding skills or to explore an idea) but didn't have all the data required. Rather than spend time searching for or collecting large volumes of data appropriate for your specific problem, you could simply (and quickly) create a simulated dataset.

I will show you how to create a dataset representing a portfolio of lives with lump sum life insurance cover, but the following techniques can be used to create all sorts of data. We will end up with a large dataset where each row will represent one life and the columns will contain information about that persons age, salary, occupation, gender, location and, finally, whether or not that person died during a specific period.

In order to do this we will need to come up with some assumptions about the variables we wish to create. The assumptions I will make will be very simple and are listed below.

Age: Normally distributed with mean age = 40 and variance = 25
Gender: 40% of the sample will be Male, 60% female
Occupation: 3 categories representing different occupations. 50% of the lives will be in category 1, 20% in category 2 and 30% in category 3
Location: 4 categories representing different areas of residence. 10% will be in category 1, 20% in category 2, 50% in category 3 and 20% in category 4
Salary: Gamma distributed. I will assume that salary varies by age and occupation
Exposure: This variable represents the length of time (in years) that each person was covered for. This will come from a uniform(0,3) distribution (i.e the maximum cover period is 3 years)
Claim: I will assume that the probability of death is a formula linked to age and occupation. The base level of mortality will be $\frac{Age}{20000}$. Mortality increases by 20% for those in occupation category 1 and by 15% for those living in locations 3 or 4

Creating the data in R

Setting Assumptions

I'll assume you have R Studio installed and some experience with basic R programming. If you have not used R before, I recommend visiting https://www.datacamp.com and taking the free intro to R course. We will be using the simstudy package. If you have not yet installed this package, run install.packages('simstudy') to do so.

Simstudy first requires us to generate a table containing our assumptions. Each row in this table will contain information relating to one of our variables. We will need to specify a name for each variable, the distribution we wish to simulate from and formula for the mean of each variable (and in some cases, variance). To create the assumptions data, use the defData() function.

First, load the necessary libraries.

library(simstudy)
library(ggplot2)
In [2]:
# initialise our assumptions table and add the Age variable
def = defData(varname = 'Age', dist = 'normal', formula = 40, variance = 25)

# the assumptions table has now been created and can be viewed
def
varnameformulavariancedistlink
Age 40 25 normal identity

To add to this table, use the defData() function again, but this time specifying the name of the table we've just created using the dtDefs argument. For example, add the gender, occupation, location and exposure variables as follows:

In [3]:
# add a binary variable to indicate gender - 40% of lives will be Male 
# and will be represented by as 1's
def = defData(dtDefs = def,varname = "Gender", dist = "binary", formula = 0.4)

# add a categorical variable to define occupation type.
# There will be three classes.
def = defData(dtDefs = def,varname = "Occupation", dist = "categorical",formula = "0.5;0.2;0.3")

# add the location variable to the data table
def = defData(dtDefs = def,varname = "Location",dist = "categorical",formula = "0.1;0.2;0.5;0.2")

# add a variable to indicate 'exposure years'
# (in other words, the length of time an individual was insured)
def = defData(def, varname = 'Exposure', dist = "uniform", formula = "0;3")

# the assumptions table now looks like:
def
varnameformulavariancedistlink
Age 40 25 normal identity
Gender 0.4 0 binary identity
Occupation 0.5;0.2;0.3 0 categorical identity
Location 0.1;0.2;0.5;0.2 0 categorical identity
Exposure 0;3 0 uniform identity

Since salary and claim depend on other variables, the 'formula' argument for these variables will be a bit more complicated. I have used the paste() function to specify the formulas to be passed into defData().

In [4]:
# add a salary variable - the mean of which will increase 
# with age and will vary by occupation class.
# Use a gamma distribution

salary_formula = paste("ifelse(Occupation == 1, Age * 800, 
                        ifelse(Occupation == 2, Age * 1200, Age * 1350))")

def = defData(def,varname = "Salary", dist = "gamma",
              formula = salary_formula,variance = 0.2)

# Now, add the binary claim indicator
# (the probability of death varies as specified earlier)
claim_formula = paste("(Age * 1/20000)",
                      " * ifelse(Occupation == 1,1.2,1)",
                      " * ifelse(Location %in% c(3,4), 1.15,1)",
                      sep = "")

def = defData(def, varname = "Claim", dist = "binary",formula = claim_formula)

# the final assumptions table now looks like:
def
varnameformulavariancedistlink
Age 40 25.0 normal identity
Gender 0.4 0.0 binary identity
Occupation 0.5;0.2;0.3 0.0 categorical identity
Location 0.1;0.2;0.5;0.2 0.0 categorical identity
Exposure 0;3 0.0 uniform identity
Salary ifelse(Occupation == 1, Age * 800, ifelse(Occupation == 2, Age * 1200, Age * 1350)) 0.2 gamma identity
Claim (Age * 1/20000) * ifelse(Occupation == 1,1.2,1) * ifelse(Location %in% c(3,4), 1.15,1) 0.0 binary identity

Generating the Data

Now, actually generating the data is easy. Do this using the genData() function. The first argument, n, is how many records you want to simulate. Then just pass in the name of the assumptions table. I will simulate 1.5m records.

In [5]:
# simulate 1.5m records
dt = genData(n = 1500000,def)

# view the first few rows
head(dt)
idAgeGenderOccupationLocationExposureSalaryClaim
1 42.08506 1 3 1 2.749571338489.57 0
2 45.20828 0 2 2 1.241022535048.71 0
3 31.44311 0 1 3 0.628252020921.03 0
4 49.38805 1 2 4 0.287662494668.40 0
5 33.44234 1 1 4 2.340801615132.47 0
6 42.73697 1 2 2 0.553440760448.24 0

Exploring the Simulated Data

Now, we can summarise and visualise the dataset that has been created to check that everything is as expected. The summary() and table() functions are very useful and the ggplot2 library is particularly good for creating visualisations. Use the ggplot() function to create density plots, bar charts and scatter plots to check that distributions have been created according to our specified assumptions and to check the relationships between variables. If you have not installed ggplot2, execute install.packages('ggplot2')

First, use summary() to see statistics for the entire dataset. Then use table() and prop.table() to see how claims vary by occupation and/or location.

In [6]:
# Summary stats
summary(dt)

print("Claim rate by occupation")
# prop table can be used to convert the tables above to percentages
print("Occupation")
round(prop.table(table(dt$Occupation, dt$Claim), margin= 1),5)
       id               Age            Gender         Occupation   
 Min.   :      1   Min.   :14.35   Min.   :0.0000   Min.   :1.000  
 1st Qu.: 375001   1st Qu.:36.64   1st Qu.:0.0000   1st Qu.:1.000  
 Median : 750000   Median :40.00   Median :0.0000   Median :1.000  
 Mean   : 750000   Mean   :40.00   Mean   :0.4007   Mean   :1.799  
 3rd Qu.:1125000   3rd Qu.:43.38   3rd Qu.:1.0000   3rd Qu.:3.000  
 Max.   :1500000   Max.   :63.35   Max.   :1.0000   Max.   :3.000  
    Location      Exposure             Salary             Claim         
 Min.   :1.0   Min.   :0.0000012   Min.   :   825.3   Min.   :0.000000  
 1st Qu.:2.0   1st Qu.:0.7498038   1st Qu.: 25570.8   1st Qu.:0.000000  
 Median :3.0   Median :1.5007302   Median : 37225.3   Median :0.000000  
 Mean   :2.8   Mean   :1.5002358   Mean   : 41795.1   Mean   :0.002397  
 3rd Qu.:3.0   3rd Qu.:2.2502329   3rd Qu.: 53107.6   3rd Qu.:0.000000  
 Max.   :4.0   Max.   :2.9999990   Max.   :292813.0   Max.   :1.000000  
[1] "Claim rate by occupation"
[1] "Occupation"
   
          0       1
  1 0.99738 0.00262
  2 0.99785 0.00215
  3 0.99781 0.00219

Next, we can use density plots to check that ages are normally distributed and that salaries are a) gamma distributed and b) vary by occupation.

In [7]:
# view distribution of ages
ggplot(dt, aes(x=Age)) + geom_density() + theme_bw()

It looks like age has been simulated as expected. The curve is bell shaped, the mean age is clearly 40 and most people are aged between 30 - 50, which is also in line with expectations given the variance we specified. To check that salary has been correctly simulated, repeat the ggplot() function but replace Age with Salary in the aes() (aesthetic) argument. We will also include an additional argument - facet_wrap() - to generate different density plots for each occupation group.

In [9]:
# view distribution of salary by occupation class
ggplot(dt, aes(x = Salary)) + 
  geom_density() + 
  facet_wrap(~Occupation) +
  theme_bw() +
ggtitle("Salary Dist. by Occupation Category")

These look like gamma distributions and salary seems to be highest for occupation category 3 - which is correct!

In our assumptions table we state that salary should increase with age. We can check this using a scatter plot. Use ggplot() again but with geom_point() instead of density. Also, add y = Salary to the aesthetic.

In [10]:
# plot sample of age against salary
# (plot first 1000 rows for a better visual. 'pch = 21' controls the type of point)
ggplot(dt[1:1000,], aes(x = Age, y = Salary)) + geom_point(pch = 21) + theme_bw() + 
ggtitle("Age/Salary Scatter")

Finally, view a bar chart showing counts of each location category. This time use geom_bar() and add facet_wrap() to view separate bar charts for each occupation class.

In [11]:
# distribution of locations by occupation
ggplot(dt, aes(x = factor(Location))) + geom_bar() + theme_bw() + 
  facet_wrap(~Occupation) + 
xlab("Location") + 
ggtitle("Location by Occupation Category")

Creating new Features

This last section will demonstrate how we could add some simple extra features to the dataset. These new features will be created using existing variables. Currently, age is exact and so we have lots of unique values. To reduce the number of distinct values I will create an 'Age_Last' variable. In addition, sometimes it is useful to transform numerical variables to categorical and so I will show you how to create a 'Salary_Band' feature from the numerical salary variable. This will group salary into a number of different categories which we will treat as nominal(you may also wish to treat these as ordinal factors).

In [12]:
# create an 'Age_Last' variable from Age exact that we simulated earlier
dt$Age_Last = floor(dt$Age)

# we can also create a categorical variable from the raw numeric 
# salary variable - e.g group into bins
# We will create groups using a helper function and custom defined breaks
# Bins can include, for example, those earning <25k, 25-35k, 35-45k, 45-75k,75k+
# (I've just arbitrarily picked these)

quant_bin = function(x, breaks){
  bin = as.numeric(cut(x,
                       breaks = breaks,
                       include.lowest = T, right = T))
  bin
}

breaks = c(0,25000,35000,45000,75000, max(dt$Salary))

dt$Salary_Band = quant_bin(dt$Salary, breaks)

# Now, view the first few rows of the data again
head(dt)
idAgeGenderOccupationLocationExposureSalaryClaimAge_LastSalary_Band
1 37.19762 1 3 4 2.801975861010.78 0 37 4
2 38.84911 1 1 3 2.405501658132.73 0 38 4
3 47.79354 0 1 3 2.893927042853.31 0 47 3
4 40.35254 1 1 3 0.586088559369.07 0 40 4
5 40.64644 1 3 3 1.367998358268.46 0 40 4
6 48.57532 1 1 1 0.332896922582.35 0 48 1

Summary

In summary, you should now be able to:

  • Generate a table of assumptions
  • Simulate a large dataset using the assumptions table
  • Check summary statistics and visualise the data
  • Create some simple new features

Comments

comments powered by Disqus